Skip to content

Coalesce HTTP COG range reads and parse IFDs once per dask graph#1534

Merged
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/http-cog-range-coalescing
May 9, 2026
Merged

Coalesce HTTP COG range reads and parse IFDs once per dask graph#1534
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/http-cog-range-coalescing

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Two related fixes for the HTTP COG read path, surfaced by the recent geotiff performance audit (P2 + P5).

P2 -- merge adjacent tile ranges into fewer GETs. _read_cog_http previously fired one GET Range: per tile through an 8-worker pool, so wall time scaled as ceil(N_tiles / 8) * RTT. COG tiles sit sequentially in the file, so a new coalesce_ranges helper merges adjacent (offset, length) ranges whose gap is below a threshold (default 1 MB, configurable via XRSPATIAL_COG_COALESCE_GAP). split_coalesced_bytes slices the returned bytes back per-tile, and _HTTPSource.read_ranges_coalesced wraps the existing read_ranges so the threadpool model is unchanged.

The 1 MB gap threshold is chosen empirically: most compressed COG tiles are well under 1 MB and tile rows are stored back-to-back, so the threshold tolerates small interleaved metadata without dragging in unrelated overview data. Set the env var to -1 to disable merging entirely (the legacy behaviour) -- this is what the new perf test uses to measure the baseline.

P5 -- parse IFDs once per dask graph. read_geotiff_dask against an HTTP URL used to call read_to_array per chunk, which routed to _read_cog_http and fired a fresh 16 KB header GET every time. _read_cog_http is now split into _parse_cog_http_meta (one IFD parse) and _fetch_decode_cog_http_tiles (window-aware tile fetch + decode). read_geotiff_dask parses metadata once before constructing the graph and threads the parsed (TIFFHeader, IFD) into the delayed tasks via a new internal http_meta kwarg.

Public API is unchanged. read_geotiff_dask, _HTTPSource.read_ranges, and read_to_array all keep their existing signatures.

Reproduction numbers

Mocked 50 ms RTT, 64-tile COG:

wall time range GETs
baseline (XRSPATIAL_COG_COALESCE_GAP=-1) ~450 ms 65
coalesced (default 1 MB) ~100 ms 2

16-chunk dask graph against the same mocked HTTP COG:

  • before P5: 17 header GETs (one per chunk + one for the eager metadata read)
  • after P5: 1 header GET

Test plan

  • pytest xrspatial/geotiff/tests/test_http_cog_coalesce.py -x -q (11 new tests, all pass)
  • pytest xrspatial/geotiff/tests/test_cog.py xrspatial/geotiff/tests/test_cog_http_concurrent.py xrspatial/geotiff/tests/test_sparse_cog.py -x -q (36 tests, all pass)
  • Full geotiff test suite (795 pass, 4 skipped). The 3 failing tests in test_features.py::TestPalette predate this change (recursion in palette plotting).

Two related performance fixes for the HTTP COG read path.

P2 -- range coalescing. _read_cog_http used to fire one GET Range:
request per tile through an 8-worker pool, so wall time scaled as
ceil(N_tiles / 8) * RTT. COG tiles are stored sequentially, so a new
helper coalesce_ranges merges adjacent (offset, length) entries whose
gap is below a threshold (default 1 MB, configurable via
XRSPATIAL_COG_COALESCE_GAP) and split_coalesced_bytes slices the
returned bytes back per-tile. _HTTPSource grows a read_ranges_coalesced
wrapper that calls the existing read_ranges underneath. On a 50 ms
RTT mocked link with 64 tiles the un-coalesced path takes ~450 ms; the
coalesced path takes ~100 ms and issues 2 GETs instead of 65.

P5 -- once-per-graph IFD parsing. read_geotiff_dask used to call
read_to_array per chunk, which on HTTP routed through _read_cog_http
and fired a fresh 16 KB header GET each time. The IFD parse is now
factored into _parse_cog_http_meta and the tile fetch into
_fetch_decode_cog_http_tiles (which honours a window). read_geotiff_dask
parses metadata once before constructing the dask graph and threads the
parsed (header, ifd) into delayed tasks via an http_meta kwarg. A
16-chunk dask graph now issues a single 16 KB header GET instead of
one per chunk.

Public API is unchanged. Existing test_cog_http_concurrent suite still
passes; new tests live in test_http_cog_coalesce.py.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026
@brendancol brendancol requested a review from Copilot May 9, 2026 15:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the GeoTIFF HTTP COG read path to reduce round-trips and redundant metadata reads, improving performance for both eager reads and dask-chunked reads.

Changes:

  • Add HTTP range coalescing utilities (coalesce_ranges, split_coalesced_bytes) and _HTTPSource.read_ranges_coalesced to merge adjacent tile byte ranges into fewer GETs.
  • Split _read_cog_http into _parse_cog_http_meta (metadata/IFD parse) and _fetch_decode_cog_http_tiles (tile fetch+decode), enabling reuse of parsed metadata.
  • Update read_geotiff_dask to parse HTTP COG metadata once and thread it through delayed window reads; add a new focused test suite for coalescing + “parse IFD once per graph”.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/_reader.py Implements range coalescing and separates HTTP COG metadata parsing from tile fetching/decoding (enabling reuse).
xrspatial/geotiff/__init__.py Reworks read_geotiff_dask HTTP path to reuse parsed COG metadata across chunk tasks.
xrspatial/geotiff/tests/test_http_cog_coalesce.py Adds unit + integration tests validating coalescing behavior and reduced header-GET count in dask.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/_reader.py Outdated
Comment thread xrspatial/geotiff/__init__.py
Comment thread xrspatial/geotiff/_reader.py
Three findings, all acted on:

- _parse_cog_http_meta retry comment claimed "If we didn't get all
  IFDs, try a larger fetch" but the gate is `len(ifds) == 0`, not
  partial-IFD recovery. Reword the comment to match: parse_all_ifds
  bails when the header GET lands short of the first IFD offset,
  and the retry expands the window in that one case. Overviews
  past the first 16 KiB still come from later reads.
- _read_tiles now skips the full-image _check_dimensions(width,
  height, ...) gate when a window is provided. With the windowed
  HTTP path, dask-chunked reads of multi-billion-pixel COGs only
  materialise the window; the full-image cap was rejecting
  legitimate reads. The single-tile budget always applies, and the
  full-image cap still applies for whole-file reads (window is
  None) and the windowed output cap (out_h * out_w * samples) is
  unchanged.
- read_geotiff_dask wraps the parsed (TIFFHeader, IFD) in a single
  dask.delayed key (`http_meta_key`) and threads it as a function
  argument into _delayed_read_window's inner @dask.delayed task,
  rather than capturing it via Python closure. Distributed/process
  schedulers now serialise the metadata once across the whole
  graph instead of once per chunk. On COGs with millions of tiles
  (and correspondingly large TileOffsets/TileByteCounts tuples in
  the IFD) this saves O(N_chunks) of redundant pickle work.
@brendancol brendancol merged commit 855de4c into xarray-contrib:main May 9, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants